Adaptive Quantization for Deep Neural Network

نویسندگان

Yiren Zhou

Seyed-Mohsen Moosavi-Dezfooli

Ngai-Man Cheung

Pascal Frossard

چکیده

In recent years Deep Neural Networks (DNNs) have been rapidly developed in various applications, together with increasingly complex architectures. The performance gain of these DNNs generally comes with high computational costs and large memory consumption, which may not be affordable for mobile platforms. Deep model quantization can be used for reducing the computation and memory costs of DNNs, and deploying complex DNNs on mobile equipment. In this work, we propose an optimization framework for deep model quantization. First, we propose a measurement to estimate the effect of parameter quantization errors in individual layers on the overall model prediction accuracy. Then, we propose an optimization process based on this measurement for finding optimal quantization bit-width for each layer. This is the first work that theoretically analyse the relationship between parameter quantization errors of individual layers and model accuracy. Our new quantization algorithm outperforms previous quantization optimization methods, and achieves 20-40% higher compression rate compared to equal bit-width quantization at the same model prediction accuracy. Introduction Deep neural networks (DNNs) have achieved significant success in various machine learning applications, including image classification (Krizhevsky, Sutskever, and Hinton 2012; Simonyan and Zisserman 2014; Szegedy et al. 2015), image retrieval (Hoang et al. 2017; Do, Doan, and Cheung 2016), and natural language processing (Deng, Hinton, and Kingsbury 2013). These achievements come with increasing computational and memory cost, as the neural networks are becoming deeper (He et al. 2016), and contain more filters per single layer (Zeiler and Fergus 2014). While the DNNs are powerful for various tasks, the increasing computational and memory costs make it difficult to apply on mobile platforms, considering the limited storage space, computation power, energy supply of mobile devices (Han, Mao, and Dally 2015), and the real-time processing requirements of mobile applications. There is clearly a need to reduce the computational resource requirements of DNN models so that they can be deployed on mobile devices (Zhou et al. 2016). In order to reduce the resource requirement of DNN models, one approach relies on model pruning. By pruning some parameters in the model (Han, Mao, and Dally 2015), or skipping some operations during the evaluation (Figurnov et al. 2016), the storage space and/or the computational cost of DNN models can be reduced. Another approach consists in parameter quantization (Han, Mao, and Dally 2015). By applying quantization on model parameters, these parameters can be stored and computed under lower bit-width. The model size can be reduced, and the computation becomes more efficient under hardware support (Han et al. 2016). It is worth noting that model pruning and parameter quantization can be applied at the same time, without interfering with each other (Han, Mao, and Dally 2015); we can apply both approaches to achieve higher compression rates. Many deep model compression works have also considered using parameter quantization (Gupta et al. 2015; Han, Mao, and Dally 2015; Wu et al. 2016) together with other compression techniques, and achieve good results. However, these works usually assign the same bit-width for quantization in the different layers of the deep network. In DNN models, the layers have different structures, which lead to the different properties related to quantization. By applying the same quantization bit-width for all layers, the results could be sub-optimal. It is however possible to assign different bit-width for different layers to achieve optimal quantization result (Hwang and Sung 2014). In this work, we propose an accurate and efficient method to find the optimal bit-width for coefficient quantization on each DNN layer. Inspired by the analysis in (Fawzi, Moosavi-Dezfooli, and Frossard 2016), we propose a method to measure the effect of parameter quantization errors in individual layers on the overall model prediction accuracy. Then, by combining the effect caused by all layers, the optimal bit-width is decided for each layer. By this method we avoid the exhaustive search for optimal bitwidth on each layer, and make the quantization process more efficient. We apply this method to quantize different models that have been pre-trained on ImageNet dataset and achieve good quantization results on all models. Our method constantly outperforms recent state-of-the-art, i.e., the SQNRbased method (Lin, Talathi, and Annapureddy 2016) on different models, and achieves 20-40% higher compression rate compared to equal bit-width quantization. Furthermore, we give a theoretical analysis on how the quantization on layers affects DNN accuracy. To the best of our knowledge, this is the first work that theoretically analyses the relationship between coefficient quantization effect of individual layers and DNN accuracy. 1 ar X iv :1 71 2. 01 04 8v 1 [ cs .L G ] 4 D ec 2 01 7 Related works Parameter quantization has been widely used for DNN model compression (Gupta et al. 2015; Han, Mao, and Dally 2015; Wu et al. 2016). The work in (Gupta et al. 2015) limits the bit-width of DNN models for both training and testing, and stochastic rounding scheme is proposed for quantization to improve the model training performance under low bit-width. The authors in (Han, Mao, and Dally 2015) use kmeans to train the quantization centroids, and use these centroids to quantize the parameters. The authors in (Wu et al. 2016) separate the parameter vectors into sub-vectors, and find sub-codebook of each sub-vectors for quantization. In these works, all (or a majority of) layers are quantized with the same bit-width. However, as the layers in DNN have various structures, these layers may have different properties with respect to quantization. It is possible to achieve better compression result by optimizing quantization bit-width for each layer. Previous works have been done for optimizing quantization bit-width for DNN models (Hwang and Sung 2014; Anwar, Hwang, and Sung 2015; Lin, Talathi, and Annapureddy 2016; Sun, Lin, and Wang 2016). The authors in (Hwang and Sung 2014) propose an exhaustive search approach to find optimal bit-width for a fully-connected network. In (Sun, Lin, and Wang 2016), the authors first use exhaustive search to find optimal bit-width for uniform or non-uniform quantization; then two schemes are proposed to reduce the memory consumption during model testing. The exhaustive search approach only works for a relatively small network with few layers, while it is not practical for deep networks. As the number of layers increases, the complexity of exhaustive search increases exponentially. The authors in (Anwar, Hwang, and Sung 2015) use mean square quantization error (MSQE) (L2 error) on layer weights to measure the sensitivity of DNN layers to quantization, and manually set the quantization bit-width for each layer. The work in (Lin, Talathi, and Annapureddy 2016) use the signal-toquantization-noise ratio (SQNR) on layer weights to measure the effect of quantization error in each layer. These MSQE and SQNR are good metrics for measuring the quantization loss on model weights. However, there is no theoretical analysis to show how these measurements relate to the accuracy of the DNN model, but only empirical results are shown. The MSQE-based approach in (Anwar, Hwang, and Sung 2015) minimizes theL2 error on quantized weight, indicating that the L2 error in different layer has the equal effect on the model accuracy. Similarly, in (Lin, Talathi, and Annapureddy 2016), the authors maximize the overall SQNR, and suggest that quantization on different layers has equal contribution to the overall SQNR, thus has equal effect on model accuracy. Both works ignore that the various structure and position of different layers may lead to different robustness on quantization, and thus render the two approaches suboptimal. In this work, we follow the analysis in (Fawzi, MoosaviDezfooli, and Frossard 2016), and propose a method to measure the effect of quantization error in each DNN layers. Different from (Anwar, Hwang, and Sung 2015; Lin, Talathi, and Annapureddy 2016), which use empirical results to show the relationship between the measurement and DNN accuracy, we conduct a theoretical analysis to show how our proposed method relates to the model accuracy. Furthermore, we show that our bit-width optimization method is more general than the method in (Lin, Talathi, and Annapureddy 2016), which makes our optimization more accurate. There are also works (Hinton, Vinyals, and Dean 2015; Romero et al. 2014) that use knowledge distillation to train a smaller network using original complex models. It is also possible to combine our quantization framework with knowledge distillation to achieve yet better compression results. Measuring the effect of quantization noise In this section, we analyse the effect of quantization on the accuracy of a DNN model. Parameter quantization can result in quantization noise that would affect the performance of the model. Previous works have been done for analyzing the effect of input noise on the DNN model (Fawzi, MoosaviDezfooli, and Frossard 2016); here we use this idea to analyse the effect of noise in intermediate feature maps in the DNN model. Quantization optimization The goal of our paper is to find a way to achieve optimal quantization result to compress a DNN model. After the quantization, under controlled accuracy penalty, we would like the model size to be as small as possible. Suppose that we have a DNN F withN layers. Each layer i has si parameters, and we apply bi bit-width quantization in the parameters of layer i to obtain a quantized model F ′. Our optimization objective is:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

INTEGRATED ADAPTIVE FUZZY CLUSTERING (IAFC) NEURAL NETWORKS USING FUZZY LEARNING RULES

The proposed IAFC neural networks have both stability and plasticity because theyuse a control structure similar to that of the ART-1(Adaptive Resonance Theory) neural network.The unsupervised IAFC neural network is the unsupervised neural network which uses the fuzzyleaky learning rule. This fuzzy leaky learning rule controls the updating amounts by fuzzymembership values. The supervised IAFC ...

متن کامل

Adaptive Filtering Strategy to Remove Noise from ECG Signals Using Wavelet Transform and Deep Learning

Introduction: Electrocardiogram (ECG) is a method to measure the electrical activity of the heart which is performed by placing electrodes on the surface of the body. Physicians use observation tools to detect and diagnose heart diseases, the same is performed on ECG signals by cardiologists. In particular, heart diseases are recognized by examining the graphic representation of heart signals w...

متن کامل

Adaptive Filtering Strategy to Remove Noise from ECG Signals Using Wavelet Transform and Deep Learning

متن کامل

Cystoscopy Image Classication Using Deep Convolutional Neural Networks

In the past three decades, the use of smart methods in medical diagnostic systems has attractedthe attention of many researchers. However, no smart activity has been provided in the eld ofmedical image processing for diagnosis of bladder cancer through cystoscopy images despite the highprevalence in the world. In this paper, two well-known convolutional neural networks (CNNs) ...

متن کامل

Deep Neural Network Capacity

In recent years, deep neural network exhibits its powerful superiority on information discrimination in many computer vision applications. However, the capacity of deep neural network architecture is still a mystery to the researchers. Intuitively, larger capacity of neural network can always deposit more information to improve the discrimination ability of the model. But, the learnable paramet...

متن کامل

شبکه عصبی پیچشی با پنجره‌های قابل تطبیق برای بازشناسی گفتار

Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

CoRR

دوره abs/1712.01048 شماره

صفحات -

تاریخ انتشار 2017

Adaptive Quantization for Deep Neural Network

نویسندگان

چکیده

منابع مشابه

INTEGRATED ADAPTIVE FUZZY CLUSTERING (IAFC) NEURAL NETWORKS USING FUZZY LEARNING RULES

Adaptive Filtering Strategy to Remove Noise from ECG Signals Using Wavelet Transform and Deep Learning

Adaptive Filtering Strategy to Remove Noise from ECG Signals Using Wavelet Transform and Deep Learning

Cystoscopy Image Classication Using Deep Convolutional Neural Networks

Deep Neural Network Capacity

شبکه عصبی پیچشی با پنجره‌های قابل تطبیق برای بازشناسی گفتار

عنوان ژورنال:

اشتراک گذاری